Goto

Collaborating Authors

 k-means method


Clustering of illustrations by atmosphere using a combination of supervised and unsupervised learning

Kubota, Keisuke, Okuda, Masahiro

arXiv.org Artificial Intelligence

The distribution of illustrations on social media, such as Twitter and Pixiv has increased with the growing popularity of animation, games, and animated movies. The "atmosphere" of illustrations plays an important role in user preferences. Classifying illustrations by atmosphere can be helpful for recommendations and searches. However, assigning clear labels to the elusive "atmosphere" and conventional supervised classification is not always practical. Furthermore, even images with similar colors, edges, and low-level features may not have similar atmospheres, making classification based on low-level features challenging. In this paper, this problem is solved using both supervised and unsupervised learning with pseudo-labels. The feature vectors are obtained using the supervised method with pseudo-labels that contribute to an ambiguous atmosphere. Further, clustering is performed based on these feature vectors. Experimental analyses show that our method outperforms conventional methods in human-like clustering on datasets manually classified by humans.


Asymptotics for The $k$-means

Zhang, Tonglin

arXiv.org Artificial Intelligence

Clustering is one of the most important unsupervised learning techniques for understanding the underlying data structures. The goal is to partition a data set into many subsets, called clusters, such that the observations within the subsets are the most homogeneous and the observations between the subsets are the most heterogeneous. Clustering is usually carried out by specifying a similarity or dissimilarity measure between observations. Examples include the k-means [17, 19, 29, 37], the k-medians [3], the k-modes [5], and the generalized k-means [2, 31, 45], as well as many of their modifications [21, 24, 42]. Among those, the k-means has been considered as one of the most straightforward and popular methods since it was proposed sixty years ago [23, 36]. Although it is well known, the investigation of the theoretical properties is still far behind, leading to difficulties in developing more precise k-means methods in practice. The goal of the present research is to propose a new concept called clustering consistency for the asymptotics of the k-means with a resulting clustering method better than the existing k-means methods adopted by many software packages, including those adopted by R and Python.


Efficiency Evaluation of Banks with Many Branches using a Heuristic Framework and Dynamic Data Envelopment Optimization Approach: A Real Case Study

Kayvanfar, Vahid, Baziyad, Hamed, Sheikh, Shaya, Werner, Frank

arXiv.org Artificial Intelligence

Evaluating the efficiency of organizations and branches within an organization is a challenging issue for managers. Evaluation criteria allow organizations to rank their internal units, identify their position concerning their competitors, and implement strategies for improvement and development purposes. Among the methods that have been applied in the evaluation of bank branches, non-parametric methods have captured the attention of researchers in recent years. One of the most widely used non-parametric methods is the data envelopment analysis (DEA) which leads to promising results. However, the static DEA approaches do not consider the time in the model. Therefore, this paper uses a dynamic DEA (DDEA) method to evaluate the branches of a private Iranian bank over three years (2017-2019). The results are then compared with static DEA. After ranking the branches, they are clustered using the K-means method. Finally, a comprehensive sensitivity analysis approach is introduced to help the managers to decide about changing variables to shift a branch from one cluster to a more efficient one.


Aerodynamic Data Predictions Based on Multi-task Learning

Hu, Liwei, Xiang, Yu, Zhan, Jun, Shi, Zifang, Wang, Wenzheng

arXiv.org Artificial Intelligence

The quality of datasets is one of the key factors that affect the accuracy of aerodynamic data models. For example, in the uniformly sampled Burgers' dataset, the insufficient high-speed data is overwhelmed by massive low-speed data. Predicting high-speed data is more difficult than predicting low-speed data, owing to that the number of high-speed data is limited, i.e. the quality of the Burgers' dataset is not satisfactory. To improve the quality of datasets, traditional methods usually employ the data resampling technology to produce enough data for the insufficient parts in the original datasets before modeling, which increases computational costs. Recently, the mixtures of experts have been used in natural language processing to deal with different parts of sentences, which provides a solution for eliminating the need for data resampling in aerodynamic data modeling. Motivated by this, we propose the multi-task learning (MTL), a datasets quality-adaptive learning scheme, which combines task allocation and aerodynamic characteristics learning together to disperse the pressure of the entire learning task. The task allocation divides a whole learning task into several independent subtasks, while the aerodynamic characteristics learning learns these subtasks simultaneously to achieve better precision. Two experiments with poor quality datasets are conducted to verify the data quality-adaptivity of the MTL to datasets. The results show than the MTL is more accurate than FCNs and GANs in poor quality datasets.


Too Much Information Kills Information: A Clustering Perspective

#artificialintelligence

Clustering is one of the most fundamental tools in the artificial intelligence area, particularly in the pattern recognition and learning theory. In this paper, we propose a simple, but novel approach for variance-based k-clustering tasks, included in which is the widely known k-means clustering. The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only. With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability. Extensive experiments on synthetic datasets and real-world datasets show that to obtain competitive results compared with k-means method (Llyod 1982) and k-means method (Arthur and Vassilvitskii 2007), we only need 7 up to 15 k-means method and k-means method in at least 80 terms of the quality of clustering.


Too Much Information Kills Information: A Clustering Perspective

Xu, Yicheng, Chau, Vincent, Wu, Chenchen, Zhang, Yong, Zissimopoulos, Vassilis, Zou, Yifei

arXiv.org Machine Learning

Clustering is one of the most fundamental tools in the artificial intelligence area, particularly in the pattern recognition and learning theory. In this paper, we propose a simple, but novel approach for variance-based k-clustering tasks, included in which is the widely known k-means clustering. The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only. With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability. Extensive experiments on synthetic datasets and real-world datasets show that to obtain competitive results compared with k-means method (Llyod 1982) and k-means++ method (Arthur and Vassilvitskii 2007), we only need 7% information of the dataset. If we have up to 15% information of the dataset, then our algorithm outperforms both the k-means method and k-means++ method in at least 80% of the clustering tasks, in terms of the quality of clustering. Also, an extended algorithm based on the same idea guarantees a balanced k-clustering result.


K-Means Clustering for Unsupervised Machine Learning

#artificialintelligence

Artificial Intelligence (AI) and Machine Learning (ML) have revolutionized every aspect of our life and disrupted how we do business, unlike any other technology in the the history of mankind. Such disruption brings many challenges for professionals and businesses. In this article, I will provide an introduction to one of the most commonly used machine learning methods, K-Means. Machine learning is a scientific method that utilizes statistical methods along with the computational power of machines to convert data to wisdom that humans or the machine itself can use for taking certain actions. "It is a branch of artificial intelligence based on the idea that systems can learn from data, identify patterns and make decisions with minimal human intervention."